The websites which have duplicate content on the web has increased tremendously. And more than you can imagine, this content does not come from other websites but happen on the same domain of the same website. This article looks at what duplicate content is all about, canonicalization and how to use free tools to deal with this issue. Whereas the intention of this post is about duplicates in general, the ones on your own website are more important than those off it, according to ranking experts.
By Google’s definition, “Duplicate content generally refers to substantive blocks of content within or across domains that either completely matches other content or is appreciably similar. Mostly, this is not deceptive in origin.”
For the canonicals, Google’s definition is, “Many sites make the same HTML content or files available via different URLs. […] To gain more control over how your URLs appear in search results…we recommend that you pick a canonical (preferred) URL as the preferred version of the page. You can indicate your preference to Google in a number of ways. We recommend them all, though none of them is required (if you don’t indicate a canonical URL, we’ll identify what we think is the best version).”
There are different kinds of duplicate content issues on the website. First, there are duplicates which cover the content on a full domain like those with or without www, with or without trailing slash /, those with or without a file name, and so forth. The best way out of this one is by implementing 301 permanent redirects on the preferred versions of the page.
The other source of duplicates is because of dynamic url parameters set by the CMS used to make the site. Joomla is one of the worst CMS‘s with these issues because of the ways the URLS’s form on it.
Finding duplicate content on the website
There are different kinds of duplicate content on the website and several automatic methods can help you get started with it. Duplicate content checker PlagSpotter is one of the most effective tools you can use to determine duplicate content on your website and hence enable you optimize better for the search engines.
How to fix duplicate content on the website
There are different techniques you can use to fix duplicate content on a website.
- Setting the preferred version of the website’s domain is the most effective way to deal with duplicate content on the website. This means that you expressly tell the search engines which domain; the www or non-www version of the website you prefer indexing.
- Using the Google Webmaster tools to set the preferred version of the website is another way to solve this problem.
- Using canonicalization tags in the website’s Meta data can enable the search engines know what urls to index or give authority. The advantage of the canonicalization is that its quite eay to implement and there are several ways to implement on different CMS platforms whether its wordpress, Joomla or CMS made simple.
Hi
Interestingly, after reading your article I realized that I actually have a major duplicate content issue going on with one of my client sites that wasn’t addressed in your article (or any others that I can find on duplicate content) – for no other reason than it is a ‘really rare case’.
Let me explain…
Initially, most of the page URLs within the site were built in such a manner that every part of the URL beyond the domain (i.e. after the “/”) contained Upper and Lower Case letters:
Example: myclientsite.com/Diesel-Engines
Later on, the webmaster decided to change the URL structure so that everything appeared in lower case:
Example: myclientsite.com/diesel-engines
Naturally enough, this led to TWO different URLs pointing to the same content page (and when you multiply out the number of content pages involved, it amounted to 27% of the entire site).
I wasn’t phased by it as I assumed 301 redirects would completely solve the problem i.e. myclientsite.com/Diesel-Engines >>> myclientsite.com/diesel-engines
But alas! The particular version of the CMS (Contegro) that this site was built on, isn’t capable of allowing the creation of 301 redirects for the purpose of redirecting uppercase URLs to lower case ones. In other words, it doesn’t differentiate between upper and lower case URLs. So, any time someone tries to 301 redirect the upper case URLs to the lower case ones, a 301 loop is created. Not exactly the epitome of a good user experience!
While later versions of this CMS do force upper case URLs to appear as lower case ones, this doesn’t help my client’s problem of duplicate content (as the damage has already been done). For the record, it is disheartening going into Google Analytics and seeing traffic data applicable to both versions of the URLs. Even more disheartening when I find links pointing to them!
Bottom line… if 301 redirects cannot be used to resolve this problem (for the reason I’ve just stated), and Webmaster Tools URL Removal service can’t be used because the offending URLs still point to “live” pages (as opposed to deleted or blocked ones), what can I do? Is this client’s site destined for eternal damnation from Google’s condemnation of duplicate content? NOTE: Ever since the roll-out of Penguin 2.0, the site has slowly been dropping in the rankings. After exhaustive site audits and reviews, the ONLY real issue I can find that is remotely related to what the common understanding of Penguin 2.0 is about, is the issue of duplicate content (i.e. duplicate page titles, descriptions, body copy etc.) applicable to the issues I’ve outlined above.
Love to hear your thoughts … and even better still, “LOVE TO KNOW HOW TO FIX IT!”
Cheers
Bruce
July 9, 2013 at 2:41 am Bruce Smeaton